Search CORE

129 research outputs found

Data Leakage via Access Patterns of Sparse Features in Deep Learning-based Recommendation Systems

Author: Annavaram Murali
Hashemi Hanieh
Ke Liu
Lee Hsien-Hsin S.
Maeng Kiwan
Suh G. Edward
Xiong Wenjie
Publication venue
Publication date: 12/12/2022
Field of study

Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model size, there was not enough attention paid to the potential information leakage through sparse features. These sparse features are employed to track users' behavior, e.g., their click history, object interactions, etc., potentially carrying each user's private information. Sparse features are represented as learned embedding vectors that are stored in large tables, and personalized recommendation is performed by using a specific user's sparse feature to index through the tables. Even with recently-proposed methods that hides the computation happening in the cloud, an attacker in the cloud may be able to still track the access patterns to the embedding tables. This paper explores the private information that may be learned by tracking a recommendation model's sparse feature access patterns. We first characterize the types of attacks that can be carried out on sparse features in recommendation models in an untrusted cloud, followed by a demonstration of how each of these attacks leads to extracting users' private information or tracking users by their behavior over time

arXiv.org e-Print Archive

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Author: Ardalani Newsha
Bhosale Shruti
Huang Haiyang
Ke Liu
Lee Benjamin
Lee Hsien-Hsin S.
Sridhar Anjali
Sun Anna
Wu Carole-Jean
Publication venue
Publication date: 17/06/2023
Field of study

Mixture-of-Experts (MoE) models have gained popularity in achieving state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.23

\times

for LM, 5.75-10.98

\times

for MT Encoder and 2.58-5.71

\times

for MT Decoder. It also reduces memory usage by up to 1.36

\times

for LM and up to 1.1

\times

for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by up to 1.47

\times

. We finally propose a load balancing methodology that provides additional scalability to the workload

arXiv.org e-Print Archive

Thermal-aware 3D Microarchitectural Floorplanning

Author: Ballapuram Chinnakrishnan S.
Ekpanyapong Mongkol
Healy Michael
Lee Hsien-Hsin Sean
Lim Sung Kyu
Loh Gabriel H.
Publication venue: Georgia Institute of Technology
Publication date: 01/01/2004
Field of study

Next generation deep submicron processor design will need to take into consideration many performance limiting factors. Flip flops are inserted in order to prevent global wire delay from becoming nonlinear, enabling deeper pipelines and higher clock frequency. The move to 3D ICs will also likely be used to further shorten wirelength. This will cause thermal issues to become a major bottleneck to performance improvement. In this paper we propose a floorplanning algorithm which takes into consideration both thermal issues and profile weighted wirelength using mathematical programming. Our profile-driven objective improves performance by 20% over wirelength-driven. While the thermal-driven objective improves temperature by 24% on average over the profile-driven case

Scholarly Materials And Research @ Georgia Tech

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

Author: Brooks David
Gupta Udit
Johnson Jeff
Lai Liangzhen
Lam Maximilian
Lee Hsien-Hsin S.
Leontiadis Ilias
Li Yang
Maeng Kiwan
Reddi Vijay Janapa
Rhu Minsoo
Suh G. Edward
Wei Gu-Yeon
Xiong Wenjie
Publication venue
Publication date: 25/09/2023
Field of study

On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than

20 \times

over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over

5 \times

additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to

100,000

queries per second -- a

>100 \times

throughput improvement over a CPU-based baseline -- while maintaining model accuracy

arXiv.org e-Print Archive

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Author: Brooks David
Gupta Udit
Hsia Samuel
Lee Hsien-Hsin S.
Reagen Brandon
Saraph Vikram
Wang Xiaodong
Wei Gu-Yeon
Wu Carole-Jean
Publication venue
Publication date: 08/01/2020
Field of study

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines

arXiv.org e-Print Archive

Crossref

Adaptive transaction scheduling for transactional memory systems

Author: Dean Lewis
Dong Hyuk
Eric Fontaine
Fayez Mohamood
Joshua Fryman
M. Yoo
Mrinmoy Ghosh
Nak Hee Seong
Nishank Ch
Pratik Marolia
Professor Hsien-hsin
S. Lee
Taeweon Suh
Vikas Vasisht
Weidong Shi
Woo Not
Publication venue: Georgia Institute of Technology
Publication date: 01/01/2008
Field of study

Transactional memory systems are expected to enable parallel programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations will actually become dominant in future workloads when large-scale transactions are frequently executed. In this thesis, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.M.S.Committee Chair: Lee, Hsien-Hsin; Committee Member: Blough, Douglas; Committee Member: Yalamanchili, Sudhaka

Scholarly Materials And Research @ Georgia Tech

CiteSeerX

Kernel Formula Approach to the Universal Whitham Hierarchy

Author: A. Boyarsky
A. Boyarsky
A. V. Zabrodin
A. Zabrodin
B. A. Dubrovin
B. A. Dubrovin
B. Konopelchenko
E. Witten
F. Guil
Hsin-Fu Shen
I. Krichever
I. Krichever
I. Krichever
K. Takasaki
K. Takasaki
K. Takasaki
K. Takasaki
K. Takasaki
K. Takasaki
L. Bonora
L. M. Alonso
L. Martínez-Alonso
L.-P. Teo
M. Mañas
M. Mineev-Weinstein
Ming-Hsien Tu
Niann-Chern Lee
P. B. Wiegmann
R. CarroU
R. Dijkgraaf
S. Aoyama
Y. Kodama
Y. Kodama
Y.-T. Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/01/2010
Field of study

We derive the dispersionless Hirota equations of the universal Whitham hierarchy from the kernel formula approach proposed by Carroll and Kodama. Besides, we also verify the associativity equations in this hierarchy from the dispersionless Hirota equations and give a realization of the associative algebra with structure constants expressed in terms of the residue formulas.Comment: 18 page

arXiv.org e-Print Archive

Crossref